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ABSTRACT 

In  this  work,  we  propose  a  framework  for  foreground  representa¬ 
tion  in  video  and  illustrate  it  with  a  multi-camera  people  matching 
application.  We  first  decompose  the  video  into  foreground  and  back¬ 
ground.  A  low-level  coarse  segmentation  of  the  foreground  is  then 
used  to  generate  a  simple  graph  representation.  A  vertex  in  the  graph 
represents  the  “appearance”  of  a  corresponding  segment  in  the  fore¬ 
ground,  while  the  relationship  between  two  segments  is  encoded  by 
an  edge  between  the  corresponding  vertices.  This  provides  a  simple 
yet  powerful  and  general  representation  of  the  foreground,  which 
can  be  very  useful  in  problems  such  as  people  detection  and  track¬ 
ing.  We  illustrate  the  effectiveness  of  this  model  using  an  “exam¬ 
ple  based  query”  type  of  application  for  people  matching  in  videos. 
Matching  results  are  provided  in  multiple-camera  situations  and  also 
under  occlusion. 

Index  Terms —  Video,  Image  matching,  Image  analysis,  Machine 
vision. 


1.  INTRODUCTION  AND  PREVIOUS  WORK 

The  efficient  representation  of  foreground  objects  in  videos  is  a  sig¬ 
nificant  component  of  important  computer  vision  applications  such 
as  tracking,  detection,  matching,  and  video-indexing.  Most  often, 
the  representations  proposed  in  the  literature  are  highly  task  or  sit¬ 
uation  specific,  involve  computationally  prohibitive  offline  training, 
and  do  not  effectively  handle  changes  in  scale,  pose,  or  background. 
In  this  paper  we  propose  the  use  of  a  graphical  model  to  repre¬ 
sent  foreground  objects  in  a  video.  The  automatically  detected  fore¬ 
ground  is  (coarsely)  segmented  into  connected  segments  or  “super¬ 
pixels”  (borrowing  a  term  from  [1]).  Each  of  these  segments  is  con¬ 
sidered  as  a  vertex  in  our  graphical  model,  and  the  relationship  be¬ 
tween  segments  is  represented  by  an  edge.  This  model  helps  to  cap¬ 
ture  the  local  information,  i.e,  appearance  of  the  segments  in  the 
foreground,  without  any  explicit  shape  model.  The  representation  of 
interaction  between  segments  using  edges  allows  us  to  incorporate 
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spatial  inter-relationships  without  using  an  absolute  point  of  refer¬ 
ence. 


Previous  work  on  the  representation  of  foreground  objects  (peo¬ 
ple)  in  a  video  scene  comes  from  two  main  areas  in  computer  vision, 
namely,  tracking  and  pose  detection.  The  Hydra  system,  [2],  repre¬ 
sents  foreground  people  as  a  combination  of  a  “head-detector”  and 
an  intensity  based  template  correlation.  This  requires  the  head  of  the 
person  to  always  be  part  of  the  silhouette  and  the  appearance  tem¬ 
plate  uses  the  head  center  as  a  spatial  origin.  McKenna  et  al. ,  [3], 
represent  different  people  in  the  foreground  using  a  color  histogram 
of  the  person.  As  shown  in  [4],  such  histogram  based  representations 
cannot  discriminate  correctly  between  two  objects  (as  they  can  have 
the  same  color  distribution)  without  additional  spatial  information. 
In  [5],  blobs  corresponding  to  people  in  the  foreground  are  assigned 
to  different  body  parts  (head  and  hands)  to  track  single  individuals 
in  a  scene.  The  authors  of  [6]  segment  people  (in  a  single  camera) 
under  occlusion  by  representing  them  as  a  group  of  3  segments  (head 
+  torso  +  limbs),  and  then  estimating  the  best  arrangement  for  peo¬ 
ple  in  the  scene  using  maximum-likelihood  estimation.  The  head  of 
the  person  is  assumed  to  be  visible  throughout  the  occlusion  in  order 
to  estimate  the  origin  for  the  appearance  model.  Work  in  [7]  reports 
the  use  of  a  person  model  learned  offline  to  detect  and  represent  the 
people  in  the  foreground,  and  an  appearance  model  of  the  person 
is  learned  online  for  tracking  a  particular  person.  Recent  works  on 
pose  estimation  like  [8]  utilize  a  loose  limbed  model  to  represent  the 
3-D  pose  of  a  person  in  the  foreground.  Motion  capture  data  aligned 
with  a  coordinate  frame  of  the  calibrated  cameras  is  used  to  estimate 
and  detect  the  loosely  connected  limbs  with  a  non-parametric  belief 
propagation  algorithm.  In  [9],  the  authors  use  a  Bayesian  framework 
to  combine  pictorial  structure  spatial  models  with  hidden  Markov 
temporal  models  to  represent  a  person  in  a  video. 


In  this  paper  we  propose  a  model  for  foreground  representation 
which  combines  low-level  segmentation  and  spatial  reasoning  in  a 
meaningful  way.  This  framework  is  intuitively  ideal  and  very  flexi¬ 
ble  for  high  level  tasks  in  foreground  analysis,  foreground  indexing 
and  retrieval,  and  other  applications  in  multiple-camera  scenarios. 
We  use  an  example  based  people  matching  application  to  demon¬ 
strate  the  practical  utility  of  our  model.  The  remainder  of  this  paper 
is  organized  as  follows:  Section  2  gives  a  brief  description  of  the  pro¬ 
posed  graphical  representation.  The  matching  application  is  detailed 
in  Section  3.  Section  4  discusses  the  matching  results  and  imple¬ 
mentation  issues,  and  finally  we  conclude  with  future  directions  in 
Section  5. 


graphical 

representation 


Fig.  1.  Foreground  from  the  original  frame  (left)  is  segmented  ( cen¬ 
ter )  using  the  algorithm  in  [10],  to  generate  a  graphical  model 
(right). 


2.  THE  FOREGROUND  MODEL 

Encoding  the  local  appearance  as  well  as  spatial  relationships  be¬ 
tween  objects  in  a  scene  has  always  been  a  very  challenging  prob¬ 
lem  in  computer  vision.  In  this  paper  we  propose  a  graph  structure 
to  represent  the  foreground  in  a  scene  as  a  combination  of  local  ap¬ 
pearance  models  of  image  segments  (vertices)  along  with  a  model 
of  the  relationship  between  them  (edges).  Figure  1  depicts  the  pro¬ 
posed  model.  The  graphical  model  is  generated  after  2  preprocessing 
steps:  (a)  Foreground/Background  separation,  and  (b)  Foreground 
Segmentation. 

The  video  is  first  decomposed  into  “foreground+background” 
by  using  our  layering  based  foreground  detection  technique  detailed 
in  [11].  This  decomposition  is  very  robust  to  motion  in  the  back¬ 
ground  (moving  trees,  water  ripples,  shaky  camera,  etc)  and  provides 
a  real-time  foreground  detection  capability.  Once  the  foreground  is 
detected,  we  perform  a  low-level  color-based  segmentation  of  the 
foreground  objects  using  the  algorithm  proposed  in  [10]  (other  at¬ 
tributes  beyond  color  could  be  used  as  well).  Now,  each  segment  in 
the  foreground  is  represented  by  a  vertex  in  our  graphical  model.  The 
space  (and/or  time)  relationships  between  the  segments  is  modeled 
by  graph  edges  (bi-directional).  It  should  be  noted  that  this  repre¬ 
sentation  is  different  than  a  typical  graph,  the  edge  is  not  just  a  con¬ 
necting  mechanism  between  two  vertices  carrying  some  weight,  the 
edge  actually  models  the  relationship  between  the  two  segments,  just 
like  the  vertex  models  the  appearance  of  the  segment.  This  frame¬ 
work  can  thus  allow  for  inline  learning  and  tracking  of  the  various 
components  in  the  scene.  As  a  preliminary  investigation,  the  follow¬ 
ing  sections  demonstrate  the  capability  of  this  type  of  foreground 
representation  in  addressing  the  difficult  problem  of  example  based 
people  matching  in  multiple  camera  scenarios. 


3.1.  Problem  Setup  and  Similarity  Measures 

Consider  the  situation  in  Figure  2.  The  foreground  in  the  example 
frame  has  been  segmented  into  vertices  Ai  connected  by  the  edges 
eAi:j  where  z,  j  G  (1,2,...,  Na)  and  i  j,  Na  being  the  number 
of  vertices  in  the  example  (segments  in  the  foreground).  The  edge 
connection  rule  is  very  simple,  we  connect  two  vertices  by  an  edge 
only  when  the  corresponding  segments  are  connected.  Each  segment 
in  the  foreground  is  modeled  using  the  color  histogram1  generated 
using  non-parametric  Kernel  Density  Estimation  (refer  to  [12]).  We 
also  use  the  SIFT  features  (refer  to  [13])  detected  inside  the  segment 
to  model  its  appearance. 

The  similarity  6  (A  — ►  B)  of  the  segment  (vertex)  A  in  the 
example  frame  to  a  segment  B  in  the  queried  frame  is  computed  as 
a  product  of  the  color  and  feature  similarities: 


6(A^B)  =  p(A,B)f(A^B),  (1) 


where,  p{A,B)  is  the  Bhattacharya  coefficient,  [14],  and  f(A  — > 
B)  is  the  feature  similarity: 


p(A,B) 


Jvi 


>A(x)pjB(x)dx, 


(2) 


and 

f(A  ->  B)  =  — - — e“d((si/fA)’(si/fB)),  (3) 

MAsift 

where,  pa(x)  and  ps(x)  are  the  density  estimates  for  pixel  x  com¬ 
puted  from  the  color  distributions  in  segment  A  and  B  respectively, 
d((siftA ),  (sifts))  is  the  sum  of  squared  Euclidean  distances  be¬ 
tween  the  features  in  B  that  are  the  best  matches  for  features  in  A, 
and  MAsift  is  the  number  of  SIFT  features  in  the  segment  A. 

Apart  from  this  similarity  meassure,  we  also  use  an  edge  or 
relationship  similarity  measure  ^(eA^  — ►  £Bkl).  The  spatial  re¬ 
lationship  between  two  connected  segments/vertices  in  the  exam¬ 
ple  graph  is  modeled  by  the  graph  edge  as  a  simple  2-D  Gaussian 
J\f(pr,  pe’i  crr,  ere),  where  pr  is  the  magnitude  of  the  vector  con¬ 
necting  the  centroid  of  the  adjacent  segments  (vertices),  pe  is  the 
angle  between  this  vector  and  the  horizontal  axis,  and  ar  and  ere  are 
the  allowed  variances  respectively.2  Thus,  the  similarity  between  the 
relationship  between  two  vertices  in  the  example  frame  ( e,Aab )  with 
respect  to  the  relationship  between  two  vertices  in  the  query  frame 
(escd)  is  computed  as: 


where  rcd  is  the  magnitude  of  the  vector  connecting  the  centroid  of 
adjacent  vertices  corresponding  to  escd,  and  6ccl  is  the  angle  of  this 
vector  with  the  horizontal  axis. 


3.  ILLUSTRATIVE  APPLICATION: 

EXAMPLE  BASED  PEOPLE  MATCHING 

Finding  or  matching  a  person  viewed  in  one  scene  in  the  same  or  a 
completely  different  setting  is  an  important  problem  in  applications 
such  as  surveillance.  In  this  work,  we  assume  that  we  are  given  an 
“example”  of  an  isolated  person  (marked  by  the  security  guard  for 
example),  and  we  want  to  search  for  the  person  viewed  from  the 
same  or  different  (and  non-overlapping)  camera  location.  To  this 
end,  we  use  the  graphical  representation  proposed  above  to  convert 
this  problem  into  a  sub-graph  matching  task,  Figure  2. 


3.2.  Matching  Algorithm 

We  utilize  a  greedy  algorithm  to  perform  a  subgraph  search  as  de¬ 
picted  in  Figure  2.  Following  is  a  brief  pseudo-code  of  our  matching 
algorithm  (please  refer  to  Figure  2): 

•  For  each  vertex  say  A\  in  the  example  frame,  compute  the 
similarity  &(Ai  — ►  B]f)  using  Equation  (1),  to  all  the  ver¬ 
tices  Bk  in  the  query  frame,  and  retain  the  best  M  matches  in 
a  set  of  candidates  Cai  =  {Bk},  k  E  (1,  2, ...,  Ns). 

xWe  use  the  r-g-S  color-space  as  in  [1 1]. 

2In  all  our  experiments  we  have  used  crr  =  5  pixels  and  erg  =  7t/4 
radians. 


Foreground  in 
Example  Frame 


Foreground  in 
Query  Frame 


Fig.  2.  People  matching  viewed  as  a  sub-graph  matching  problem. 


•  Let  the  set  of  adjacent  vertices  of  A i  be  denoted  by  Adj  (Ai ) . 
In  Figure  2,  Adj(Ai)  =  A2.  For  all  vertices  Bk  in  Ca19 
find  Bi  G  Adj(Bk)  such  that  Bi  G  Ca2-  Now,  compute  the 
Global  Evidence  for  matching  A\  to  Bk  as: 

GE(A!  Bk)  =  6(A2  Bi)^f(eAl2  -►  eBkl).  (5) 

•  Assign  Ai  to  that  vertex  Bk  G  Ca1  which  maximizes  the 
Global  Evidence.  The  similarity  score  for  this  vertex  assign¬ 
ment  is  computed  as  Sa1  =  <S(Ai  — >  Bk)  +  GE(A±  — > 
Bk).  Similarly  compute  the  assignments  for  the  remaining 
vertices  in  the  example  graph.  The  net  similarity  score  for  the 
matched  subgraph  is  given  by  YluA  ^Ai  • 
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4.  RESULTS  AND  COMPARISON 

Figures  3  and  4  illustrate  the  example  based  people  matching  algo¬ 
rithm  explained  above.  In  Figure  3  the  example  person  is  matched  to 
the  foreground  in  query  frames  acquired  from  the  same  camera  lo¬ 
cation.  It  should  be  noted  that  the  greedy  algorithm  is  able  to  make 
the  correct  matches  in  spite  of  variations  in  pose  and  considerable 
occlusion.  In  Figure  4,  the  example  person  is  compared  to  the  fore¬ 
ground  in  a  completely  different  camera  view,  leading  to  significant 
differences  in  scale  and  illumination.  The  results  show  that  our  rep¬ 
resentation  and  matching  algorithm  can  robustly  handle  variations 
in  illumination,  pose,  scale,  and  camera-view.  We  also  perform  a 
coarse  comparison  with  the  covariance  distance  approach  used  in 
[15]. 3  We  first  compute  the  average  similarity  score  Savg  and  av¬ 
erage  covariance  distance  T>avg  of  the  example  in  Figure  4(a)  with 
respect  to  the  same  person  in  the  top  three  rows  in  Figure  4(b).  We 
then  compute  the  similarity  (S bad)  and  covariance  distance  (Dbad) 
to  a  bad  match  (last  row  in  Figure  4(b)).  Now,  we  can  approximately 
compare  the  discriminative  power  of  the  two  measures  by  comparing 
the  ratios  Vs  =  andVn  =  Sbad  .4  We  observe  that  Vs  ~  5.0, 

^ bad  *-'avg 

whereas  Vu  ~  1.5  for  the  example  in  Figure  4,  indicating  that  our 
similarity  measure  is  more  discriminative  and  hence  allows  for  more 

3It  should  be  noted  that  the  covariance  tracker  proposed  in  [15]  does  not 
separate  foreground  from  background,  which  leads  to  less  robust  matching. 
Because  of  our  particular  foreground  representation,  we  are  able  to  correctly 
match  the  person  along  with  giving  exact  segmentation  for  the  matching  re¬ 
gion. 

4Please  note  that  we  use  a  similarity  measure,  which  is  (conceptually) 
inversely  related  to  the  notion  of  distance.  Hence  the  use  of  reciprocal  ratios 
for  Vs  and  V&  ■ 


Fig.  3.  (a)  Automatically  detected  foreground  (center)  from  the  ex¬ 
ample  frame  (left)  is  segmented  (right)  to  get  a  2  vertex  graph  repre¬ 
sentation.  (b)  The  segmentation  of  the  query  frames  (left  column)  is 
shown  with  random  colors  ( center  column )  while  the  matching  result 
is  shown  in  the  right  column. 


robust  matching.  These  results  (including  foreground  detection  and 
segmentation)  were  achieved  at  run-times  of  less  than  1  second  per 
query  frame,  using  non-optimized  experimental  code,  on  a  standard 
laptop  computer  with  a  1.8GHz  Centrino  Processor.  We  used  a  lite 
version  of  the  SIFT  feature  extraction,  code  provided  by  [16].  Color 
density  estimation  was  performed  using  the  Improved  Fast  Gauss 
Transform  algorithm  [12]. 


5.  CONCLUSION  AND  FUTURE  DIRECTIONS 

We  presented  a  novel  scheme  utilizing  real-time  foreground/background 
separation  and  low-level  foreground  segmentation  to  generate  a  graph¬ 
ical  model  of  the  appearance  and  relationship  between  objects  (or 
object  parts)  in  the  foreground.  The  effectiveness  of  this  representa¬ 
tion  was  demonstrated  with  an  example  based  query  type  of  people 
matching  algorithm,  which  along  with  its  simple  set-up,  provides 
state-of-the-art  results.  We  plan  to  further  enhance  the  capability 
of  our  representation  by  improving  the  segments  relationship  model 
using  more  features  and  also  incorporating  information  about  tem¬ 
poral  variations  (temporal  edges).  Endeavors  in  these  directions  will 
be  reported  elsewhere. 
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Fig.  4.  (a)  Foreground  (center)  from  the  example  frame  (left)  is  seg¬ 
mented  (right)  to  get  a  3  vertex  graph  representation,  (b)  Top  3 
rows:  Note  that  the  example  is  compared  with  frames  from  a  differ¬ 
ent  camera  location  (left  column),  our  matching  results  are  shown 
in  the  center  column.  The  column  on  the  right  is  the  rectangular 
window  used  to  compute  the  covariance  distance  as  used  in  [15]. 
Last  row:  We  compute  the  similarity  score  and  covariance  distance 
between  the  example  in  (a)  and  last  row  of  (b),  in  order  to  coarsely 
compare  the  discriminative  power  of  our  matching  scheme  with  the 
covariance  distance. 
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